Overview

Dataset statistics

Number of variables4
Number of observations36282
Missing cells2882
Missing cells (%)2.0%
Duplicate rows5108
Duplicate rows (%)14.1%
Total size in memory2.1 MiB
Average record size in memory62.1 B

Variable types

Numeric2
Boolean2

Alerts

Dataset has 5108 (14.1%) duplicate rowsDuplicates
q is highly overall correlated with wHigh correlation
w is highly overall correlated with qHigh correlation
q_flag is highly overall correlated with w_flagHigh correlation
w_flag is highly overall correlated with q_flagHigh correlation
q has 1468 (4.0%) missing valuesMissing
w has 1353 (3.7%) missing valuesMissing

Reproduction

Analysis started2023-04-17 15:43:45.937748
Analysis finished2023-04-17 15:43:52.715937
Duration6.78 seconds
Software versionpandas-profiling vv3.5.0
Download configurationconfig.json

Variables

q
Real number (ℝ)

HIGH CORRELATION
MISSING

Distinct1643
Distinct (%)4.7%
Missing1468
Missing (%)4.0%
Infinite0
Infinite (%)0.0%
Mean12.655559
Minimum0.693
Maximum281
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size1.6 MiB
2023-04-17T15:43:52.805896image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Quantile statistics

Minimum0.693
5-th percentile2.44
Q14.79
median8.145
Q314.6
95-th percentile38.335
Maximum281
Range280.307
Interquartile range (IQR)9.81

Descriptive statistics

Standard deviation14.107686
Coefficient of variation (CV)1.1147423
Kurtosis28.360781
Mean12.655559
Median Absolute Deviation (MAD)4.055
Skewness3.9881842
Sum440590.63
Variance199.02682
MonotonicityNot monotonic
2023-04-17T15:43:52.965160image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
10.8 239
 
0.7%
11.1 208
 
0.6%
10.2 189
 
0.5%
11.5 186
 
0.5%
10.4 185
 
0.5%
10.1 184
 
0.5%
10.6 176
 
0.5%
12.2 172
 
0.5%
11.8 172
 
0.5%
9.45 167
 
0.5%
Other values (1633) 32936
90.8%
(Missing) 1468
 
4.0%
ValueCountFrequency (%)
0.693 1
 
< 0.1%
0.767 1
 
< 0.1%
0.946 2
 
< 0.1%
1.03 1
 
< 0.1%
1.04 3
 
< 0.1%
1.13 8
< 0.1%
1.14 1
 
< 0.1%
1.19 1
 
< 0.1%
1.23 13
< 0.1%
1.24 1
 
< 0.1%
ValueCountFrequency (%)
281 1
< 0.1%
252 1
< 0.1%
208 1
< 0.1%
203 1
< 0.1%
202 1
< 0.1%
201 1
< 0.1%
195 1
< 0.1%
185 1
< 0.1%
182 1
< 0.1%
181 1
< 0.1%

q_flag
Boolean

Distinct2
Distinct (%)< 0.1%
Missing61
Missing (%)0.2%
Memory size1.6 MiB
False
31473 
True
4748 
(Missing)
 
61
ValueCountFrequency (%)
False 31473
86.7%
True 4748
 
13.1%
(Missing) 61
 
0.2%
2023-04-17T15:43:53.114813image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

w
Real number (ℝ)

HIGH CORRELATION
MISSING

Distinct266
Distinct (%)0.8%
Missing1353
Missing (%)3.7%
Infinite0
Infinite (%)0.0%
Mean104.52504
Minimum34
Maximum385
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size1.6 MiB
2023-04-17T15:43:53.233275image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Quantile statistics

Minimum34
5-th percentile49
Q176
median101
Q3127
95-th percentile175
Maximum385
Range351
Interquartile range (IQR)51

Descriptive statistics

Standard deviation39.1233
Coefficient of variation (CV)0.37429597
Kurtosis1.3424243
Mean104.52504
Median Absolute Deviation (MAD)25
Skewness0.85669362
Sum3650955
Variance1530.6326
MonotonicityNot monotonic
2023-04-17T15:43:53.395291image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
106 453
 
1.2%
98 448
 
1.2%
104 441
 
1.2%
99 434
 
1.2%
96 425
 
1.2%
105 424
 
1.2%
95 424
 
1.2%
92 420
 
1.2%
102 412
 
1.1%
100 409
 
1.1%
Other values (256) 30639
84.4%
(Missing) 1353
 
3.7%
ValueCountFrequency (%)
34 1
 
< 0.1%
35 3
 
< 0.1%
36 4
 
< 0.1%
37 30
 
0.1%
38 52
 
0.1%
39 65
0.2%
40 83
0.2%
41 95
0.3%
42 134
0.4%
43 139
0.4%
ValueCountFrequency (%)
385 1
< 0.1%
368 1
< 0.1%
342 2
< 0.1%
338 1
< 0.1%
337 1
< 0.1%
335 1
< 0.1%
325 2
< 0.1%
319 1
< 0.1%
315 1
< 0.1%
312 2
< 0.1%

w_flag
Boolean

Distinct2
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size1.3 MiB
False
30438 
True
5844 
ValueCountFrequency (%)
False 30438
83.9%
True 5844
 
16.1%
2023-04-17T15:43:53.580313image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Interactions

2023-04-17T15:43:49.840980image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2023-04-17T15:43:47.398887image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2023-04-17T15:43:51.064580image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
2023-04-17T15:43:48.634254image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Correlations

2023-04-17T15:43:53.858878image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Auto

The auto setting is an interpretable pairwise column metric of the following mapping:
  • Variable_type-Variable_type : Method, Range
  • Categorical-Categorical : Cramer's V, [0,1]
  • Numerical-Categorical : Cramer's V, [0,1] (using a discretized numerical column)
  • Numerical-Numerical : Spearman's ρ, [-1,1]
The number of bins used in the discretization for the Numerical-Categorical column pair can be changed using config.correlations["auto"].n_bins. The number of bins affects the granularity of the association you wish to measure.

This configuration uses the recommended metric for each pair of columns.
2023-04-17T15:43:53.974627image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
2023-04-17T15:43:54.090351image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
2023-04-17T15:43:54.205622image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
2023-04-17T15:43:54.318263image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.
2023-04-17T15:43:54.421428image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

2023-04-17T15:43:52.368371image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
A simple visualization of nullity by column.
2023-04-17T15:43:52.492367image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
2023-04-17T15:43:52.646101image/svg+xmlMatplotlib v3.5.2, https://matplotlib.org/
The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

Sample

qq_flagww_flag
date
1922-09-01NaNNaN109.0False
1922-09-02NaNNaN107.0False
1922-09-03NaNNaN104.0False
1922-09-04NaNNaN102.0False
1922-09-05NaNNaN100.0False
1922-09-06NaNNaN98.0False
1922-09-07NaNNaN97.0False
1922-09-08NaNNaN96.0False
1922-09-09NaNNaN95.0False
1922-09-10NaNNaN95.0False
qq_flagww_flag
date
2021-12-226.27True68.0True
2021-12-235.89True66.0True
2021-12-246.31True69.0True
2021-12-2512.90True102.0True
2021-12-2614.70True110.0True
2021-12-2717.00True119.0True
2021-12-2823.20True138.0True
2021-12-2936.60True171.0True
2021-12-3039.40True176.0True
2021-12-3134.70True168.0True

Duplicate rows

Most frequently occurring

qq_flagww_flag# duplicates
5095NaNFalseNaNFalse1353
25037.78False105.0False103
21786.86False102.0False98
28909.09False109.0False98
23937.45False104.0False97
26038.09False106.0False96
22867.16False103.0False95
18305.92False99.0False93
27988.77False108.0False92
15045.10False92.0False90